3 Vector Data (sf basics)

In this chapter, we will work with the following packages. Before starting any exercises, make sure to run the following code.

if (!require(pacman)) install.packages("packman")

pacman::p_load(tidyverse,
               sf,
               mapview)

Learning objectives:

  1. Understand the data format of vector data

  2. Understand the types and structures of vector data (points, lines, and polygons)

  3. Being able to produce a map using vector data

  4. Being able to manipulate vector data

3.1 Vector Data in GIS

In GIS, vector data is one of the primary ways to represent geographic features on a map. Vector data uses geometric shapes – specifically points, lines, and polygons – to model real-world objects.

Each vector feature can also carry attribute data, which are stored in a table linked to the spatial features. For example, a polygon representing a park might include attributes such as its name, size, or type of vegetation.

Vector data is particularly useful for representing clearly defined boundaries and discrete features, and it allows for detailed spatial analysis and accurate map production.

3.1.1 Data format

Traditionally, vector data has been stored in the ESRI Shapefile format (.shp). This format was originally developed by ESRI in the 1990s and remains widely used across many GIS platforms. However, it is somewhat cumbersome because it is not a single file; shapefiles require at least four associated files (.shp, .shx, .dbf, and .prj) to function properly. These files must be kept together, which makes file management and sharing more error-prone.

In recent years, the GeoPackage format (.gpkg) has gained popularity. It simplifies data storage by combining all necessary components into a single, portable file.

In the context of R, even better, vector data can also be saved using the native RDS format (.rds). This format is not cross-platform compatible with other GIS software, but it offers advantages within R: it is more memory efficient and often results in smaller file sizes. For these reasons, this book primarily uses the .rds format for storing vector data, unless compatibility with external software requires otherwise.

3.1.2 Read/Export vector data

Since it is still common to share data in ESRI Shapefile format, we’ll start by loading a .shp file into R. Then, we’ll save the imported data in .rds format to simplify later steps in our workflow.

The sf::st_read() function from the sf package allows you to import shapefiles (.shp) as well as other standard GIS formats. For this exercise, we’ll use county data in North Carolina, which is saved in the data subdirectory – as mentioned, this data comes with four associated files (check the data subdirectory).

# read a shapefile (e.g., ESRI Shapefile format)
# `quiet = TRUE` just for cleaner output
(sf_nc_county <- st_read(dsn = "data/nc.shp",
                         quiet = TRUE))
## Simple feature collection with 100 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32377 ymin: 33.88212 xmax: -75.45662 ymax: 36.58973
## Geodetic CRS:  WGS 84
## First 10 features:
##         county                       geometry
## 1         ashe MULTIPOLYGON (((-81.47258 3...
## 2    alleghany MULTIPOLYGON (((-81.23971 3...
## 3        surry MULTIPOLYGON (((-80.45614 3...
## 4    currituck MULTIPOLYGON (((-76.00863 3...
## 5  northampton MULTIPOLYGON (((-77.21736 3...
## 6     hertford MULTIPOLYGON (((-76.74474 3...
## 7       camden MULTIPOLYGON (((-76.00863 3...
## 8        gates MULTIPOLYGON (((-76.56218 3...
## 9       warren MULTIPOLYGON (((-78.30849 3...
## 10      stokes MULTIPOLYGON (((-80.02545 3...

The first several lines of the output summarize the contents of an sf object. It contains 100 features (rows), each with one attribute field and a geometry of type MULTIPOLYGON (line Geometry type:), which is commonly used to represent areas like counties. The coordinates are XY two-dimensional (line Dimension:), and the bounding box shows the spatial extent of the data (line Bounding box:). The coordinate reference system (CRS) is WGS 84 (line Geodetic CRS:), a common geographic CRS based on latitude and longitude.

You can export the vector data using sf::st_write(). This function supports writing to multiple formats, including Shapefile and GeoPackage. Check the data subdirectory to see the exported files.

# save as shapefile (overwrites by setting append = FALSE)
st_write(sf_nc_county, 
         dsn = "data/sf_nc_county.shp",
         append = FALSE)

# save as Geopackage (overwrites by setting append = FALSE)
st_write(sf_nc_county, 
         dsn = "data/sf_nc_county.gpkg",
         append = FALSE)

For use within R, it is often convenient to save spatial data in .rds format using the saveRDS() function. This format is compact and efficient, making it ideal for storing intermediate results. Keep in mind, however, that .rds files are not compatible with other GIS software, so you’ll need to convert them to .shp or .gpkg for others using common GIS platforms.

# save as an RDS file (compact and efficient for use within R)
saveRDS(sf_nc_county,
        file = "data/sf_nc_county.rds")

To reload a saved .rds file in R, use the readRDS() function.

# read from an RDS file
sf_nc_county <- readRDS(file = "data/sf_nc_county.rds")

3.1.3 Point

Points represent discrete locations that have no area or length, such as the location of a weather station, a tree, or a city. Each point has a pair of coordinates (latitude and longitude or x and y) that indicate its position.

The sample data used in Chapter 2 is an example of a point vector layer. Let’s take a closer look at this dataset (saved as data/sf_finsync_nc.rds in the shared repository).

(sf_site <- readRDS("data/sf_finsync_nc.rds"))
## Simple feature collection with 122 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.02452 ymin: 34.53497 xmax: -76.74611 ymax: 36.54083
## Geodetic CRS:  WGS 84
## # A tibble: 122 × 2
##    site_id                          geometry
##    <chr>                         <POINT [°]>
##  1 finsync_nrs_nc-10013 (-81.51025 36.11188)
##  2 finsync_nrs_nc-10014 (-80.35989 35.87616)
##  3 finsync_nrs_nc-10020 (-81.74472 35.64379)
##  4 finsync_nrs_nc-10023  (-82.77898 35.6822)
##  5 finsync_nrs_nc-10024 (-77.75384 35.38553)
##  6 finsync_nrs_nc-10027 (-83.69678 35.02467)
##  7 finsync_nrs_nc-10029 (-80.65668 35.16119)
##  8 finsync_nrs_nc-10034 (-82.04497 36.09917)
##  9 finsync_nrs_nc-10041 (-80.46558 35.04365)
## 10 finsync_nrs_nc-10049 (-77.96322 34.58249)
## # ℹ 112 more rows

If you examine the geometry column, you’ll see that it contains pairs of latitude and longitude values with the notation <POINT [°]>, which specify the location of each site. Using this geographic information, we visualized the survey sites on a map in Chapter 2.1. We can map this data with mapview::mapview() function:

mapview(sf_site,
        col.regions = "black", # point's fill color
        legend = FALSE) # disable legend

3.1.4 Line

Lines (also called polylines) represent linear features such as roads, rivers, or trails. A line consists of a sequence of connected points and may include curves or bends. Lines have length, but no area.

Stream lines are an example of line geometries. We will use a sample dataset stored in data/sf_stream.rds, which illustrate stream networks within Guilford county, NC. You can load and inspect it in R as follows:

(sf_str <- readRDS("data/sf_stream.rds"))
## Simple feature collection with 5012 features and 1 field
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -80.04671 ymin: 35.9001 xmax: -79.53284 ymax: 36.25706
## Geodetic CRS:  WGS 84
## First 10 features:
##                          geometry       fid
## 1  LINESTRING (-79.90748 36.18... fid000001
## 2  LINESTRING (-79.89768 36.20... fid000002
## 3  LINESTRING (-79.94756 36.15... fid000003
## 4  LINESTRING (-79.94152 36.25... fid000004
## 5  LINESTRING (-79.94206 36.20... fid000005
## 6  LINESTRING (-79.90003 36.18... fid000006
## 7  LINESTRING (-79.94737 36.20... fid000007
## 8  LINESTRING (-79.87595 36.18... fid000008
## 9  LINESTRING (-79.88022 36.25... fid000009
## 10 LINESTRING (-79.94859 36.25... fid000010

In contrast to the point vector layer introduced earlier, this dataset’s geometry column contains the notation LINESTRING, indicating that the features represent linear geometries—specifically, stream segments. Let’s visualize this data to better understand its structure:

mapview(sf_str,
        color = "steelblue", # line's color
        legend = FALSE) # disable legend

3.1.5 Polygon

Polygons represent areas such as lakes, parks, or country boundaries. A polygon is formed by a closed sequence of lines that define its perimeter, allowing it to enclose a space and have both area and shape. As an example, we’ll use county-level polygon data from North Carolina. Recall that data/sf_nc_county.rds was exported in Section 3.1.2:

(sf_nc_county <- readRDS("data/sf_nc_county.rds"))
## Simple feature collection with 100 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32377 ymin: 33.88212 xmax: -75.45662 ymax: 36.58973
## Geodetic CRS:  WGS 84
## First 10 features:
##         county                       geometry
## 1         ashe MULTIPOLYGON (((-81.47258 3...
## 2    alleghany MULTIPOLYGON (((-81.23971 3...
## 3        surry MULTIPOLYGON (((-80.45614 3...
## 4    currituck MULTIPOLYGON (((-76.00863 3...
## 5  northampton MULTIPOLYGON (((-77.21736 3...
## 6     hertford MULTIPOLYGON (((-76.74474 3...
## 7       camden MULTIPOLYGON (((-76.00863 3...
## 8        gates MULTIPOLYGON (((-76.56218 3...
## 9       warren MULTIPOLYGON (((-78.30849 3...
## 10      stokes MULTIPOLYGON (((-80.02545 3...

In the geometry column, you’ll notice the notation MULTIPOLYGON, which indicates that each feature consists of one or more connected polygons. These are classified as polygon vectors in GIS and are commonly used to represent areas with defined boundaries. Let’s visualize these polygons as well:

mapview(sf_nc_county,
        col.regions = "grey", # polygon's fill color
        legend = FALSE) # disable legend

3.1.6 Mapping

To visualize various types of vector data together, we can use the ggplot2 package with its geom_sf() function, which natively supports spatial features.

We’ll start by plotting just the polygon layer to show the county boundaries:

ggplot() +
  geom_sf(data = sf_nc_county)

Next, we add the line layer, which represents stream networks, on top of the county polygons:

ggplot() +
  geom_sf(data = sf_nc_county) +
  geom_sf(data = sf_str)

Finally, we add the point layer to the map, which marks the survey sites. This completes the map by showing all three vector types together:

ggplot() +
  geom_sf(data = sf_nc_county) +
  geom_sf(data = sf_str) +
  geom_sf(data = sf_site)

3.2 Spatial Data Manipulation

3.2.1 Spatial join

While the map we created earlier provides a good overview, it may appear odd because the stream network is only available for Guilford County, yet the other layers (such as survey sites and county boundaries) span the entire state. To better align the spatial representation, we might want to focus on Guilford County across all layers.

To narrow down the survey sites to only those that fall within Guilford County, we can use the sf::st_join() function from the sf package. This function performs a spatial join, associating attributes from one layer (e.g., counties) to another (e.g., point locations) based on their geographic overlap.

Here, we overlay sf_site (survey sites) with sf_nc_county (county polygons) to attach county information to each point:

sf_site_join <- st_join(x = sf_site, # base layer
                        y = sf_nc_county) # overlaying layer

In the original sf_site data, there was no column identifying the county for each site:

print(sf_site)
## Simple feature collection with 122 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.02452 ymin: 34.53497 xmax: -76.74611 ymax: 36.54083
## Geodetic CRS:  WGS 84
## # A tibble: 122 × 2
##    site_id                          geometry
##    <chr>                         <POINT [°]>
##  1 finsync_nrs_nc-10013 (-81.51025 36.11188)
##  2 finsync_nrs_nc-10014 (-80.35989 35.87616)
##  3 finsync_nrs_nc-10020 (-81.74472 35.64379)
##  4 finsync_nrs_nc-10023  (-82.77898 35.6822)
##  5 finsync_nrs_nc-10024 (-77.75384 35.38553)
##  6 finsync_nrs_nc-10027 (-83.69678 35.02467)
##  7 finsync_nrs_nc-10029 (-80.65668 35.16119)
##  8 finsync_nrs_nc-10034 (-82.04497 36.09917)
##  9 finsync_nrs_nc-10041 (-80.46558 35.04365)
## 10 finsync_nrs_nc-10049 (-77.96322 34.58249)
## # ℹ 112 more rows

After running st_join(), the resulting object sf_site_join now includes additional attributes from the county layer (the county column):

print(sf_site_join)
## Simple feature collection with 122 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -84.02452 ymin: 34.53497 xmax: -76.74611 ymax: 36.54083
## Geodetic CRS:  WGS 84
## # A tibble: 122 × 3
##    site_id                          geometry county     
##  * <chr>                         <POINT [°]> <chr>      
##  1 finsync_nrs_nc-10013 (-81.51025 36.11188) wilkes     
##  2 finsync_nrs_nc-10014 (-80.35989 35.87616) davidson   
##  3 finsync_nrs_nc-10020 (-81.74472 35.64379) burke      
##  4 finsync_nrs_nc-10023  (-82.77898 35.6822) buncombe   
##  5 finsync_nrs_nc-10024 (-77.75384 35.38553) greene     
##  6 finsync_nrs_nc-10027 (-83.69678 35.02467) clay       
##  7 finsync_nrs_nc-10029 (-80.65668 35.16119) mecklenburg
##  8 finsync_nrs_nc-10034 (-82.04497 36.09917) avery      
##  9 finsync_nrs_nc-10041 (-80.46558 35.04365) union      
## 10 finsync_nrs_nc-10049 (-77.96322 34.58249) pender     
## # ℹ 112 more rows

This column isn’t randomly assigned; it reflects the actual geographic relationship: each survey site inherits the attributes of the county polygon it falls within, based on spatial coordinates.

Now that each survey site has a county identifier, we can easily subset the points located within Guilford County using familiar tidyverse syntax:

sf_site_guilford <- sf_site_join %>% 
  filter(county == "guilford")

We can also extract just the Guilford County polygon from the full county dataset:

sf_nc_guilford <- sf_nc_county %>% 
  filter(county == "guilford")

With these filtered layers, we can re-create the map – this time focusing exclusively on Guilford County and its associated stream network and survey sites:

ggplot() +
  geom_sf(data = sf_nc_guilford) +
  geom_sf(data = sf_str) +
  geom_sf(data = sf_site_guilford)

If we wish, we can customize the colors of each layer and apply a clean base theme to enhance the appearance:

ggplot() +
  geom_sf(data = sf_nc_guilford) +
  geom_sf(data = sf_str,
          color = "steelblue") +
  geom_sf(data = sf_site_guilford,
          color = "salmon") +
  theme_bw()

3.2.2 Geometric analysis

In vector data analysis, we can perform various geometric operations such as calculating the length of polylines or the area of polygons. However, it’s important to ensure that the data is using an appropriate CRS because geometric computations assume that features are laid out in a flat, two-dimensional space – geodetic CRS should not be used. In most cases, it is recommended to transform the data to a projected CRS that is suitable for the region of interest.

Length

To calculate the length of polylines, we can use the sf::st_length() function. Before doing so, it’s important to transform the data to a projected CRS. In this case, we’ll use WGS 84 / UTM Zone 17N (EPSG:32617), which is appropriate for our sample data (North Carolina, USA). We can use st_transform() to reproject the spatial object sf_str to the UTM Zone 17N coordinate reference system (EPSG:32617).

(sf_str_proj <- st_transform(sf_str, crs = 32617))
## Simple feature collection with 5012 features and 1 field
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: 585999.8 ymin: 3973842 xmax: 631847 ymax: 4012897
## Projected CRS: WGS 84 / UTM zone 17N
## First 10 features:
##                          geometry       fid
## 1  LINESTRING (598237.5 400502... fid000001
## 2  LINESTRING (599098.8 400681... fid000002
## 3  LINESTRING (594668 4001798,... fid000003
## 4  LINESTRING (595096.9 401232... fid000004
## 5  LINESTRING (595099.3 400762... fid000005
## 6  LINESTRING (598901.7 400556... fid000006
## 7  LINESTRING (594624.7 400737... fid000007
## 8  LINESTRING (601073.4 400501... fid000008
## 9  LINESTRING (600602.6 401252... fid000009
## 10 LINESTRING (594459.4 401249... fid000010

This transformation ensures that length calculations are performed in meters.

v_str_l <- st_length(sf_str_proj)

# print the first 10 elements
head(v_str_l)
## Units: [m]
## [1]   38.45906 2889.03070  145.59723  289.01825  323.75021  237.89378

The object v_str_l contains the length of each stream segment, in the same order as the features in sf_str_proj. We can attach them directly to the spatial object as a new column using dplyr::mutate(). This allows us to retain both the geometry and the calculated lengths within a single data frame structure.

(sf_str_w_len <- sf_str %>% 
  mutate(length = v_str_l))
## Simple feature collection with 5012 features and 2 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -80.04671 ymin: 35.9001 xmax: -79.53284 ymax: 36.25706
## Geodetic CRS:  WGS 84
## First 10 features:
##                          geometry       fid         length
## 1  LINESTRING (-79.90748 36.18... fid000001   38.45906 [m]
## 2  LINESTRING (-79.89768 36.20... fid000002 2889.03070 [m]
## 3  LINESTRING (-79.94756 36.15... fid000003  145.59723 [m]
## 4  LINESTRING (-79.94152 36.25... fid000004  289.01825 [m]
## 5  LINESTRING (-79.94206 36.20... fid000005  323.75021 [m]
## 6  LINESTRING (-79.90003 36.18... fid000006  237.89378 [m]
## 7  LINESTRING (-79.94737 36.20... fid000007  162.69751 [m]
## 8  LINESTRING (-79.87595 36.18... fid000008 1066.18503 [m]
## 9  LINESTRING (-79.88022 36.25... fid000009  624.27443 [m]
## 10 LINESTRING (-79.94859 36.25... fid000010  175.48170 [m]

Now, each feature in sf_str_w_len includes its corresponding length as an attribute, making it easier to visualize or analyze within the same spatial object.

Alternatively, this process can be done in a single step by combining mutate() and st_length() directly.

sf_str_w_len <- sf_str %>% 
  st_transform(crs = 32617) %>%       # transform to projected CRS (utm zone 17n) for accurate length calculation
  mutate(length = st_length(.)) %>%   # calculate length of each feature and store it in a new column
  st_transform(crs = 4326)           # transform back to geographic CRS (wgs84) for consistency with other layers

# # the above code returns identical results with the code below
# sf_str_proj <- st_transform(sf_str, crs = 32617)
# v_str_l <- st_length(sf_str_proj)               
# sf_str <- sf_str %>% 
#   mutate(length = v_str_l)                      

Area

The area of polygons can be calculated in a similar manner using the sf::st_area() function. Again, do not forget CRS transformation before performing geometric calculations.

(sf_nc_county_proj <- st_transform(sf_nc_county, crs = 32617))
## Simple feature collection with 100 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 196602 ymin: 3751723 xmax: 1002267 ymax: 4057841
## Projected CRS: WGS 84 / UTM zone 17N
## First 10 features:
##         county                       geometry
## 1         ashe MULTIPOLYGON (((457533.8 40...
## 2    alleghany MULTIPOLYGON (((478495.8 40...
## 3        surry MULTIPOLYGON (((548866.7 40...
## 4    currituck MULTIPOLYGON (((948208.3 40...
## 5  northampton MULTIPOLYGON (((839954.7 40...
## 6     hertford MULTIPOLYGON (((882486.8 40...
## 7       camden MULTIPOLYGON (((948208.3 40...
## 8        gates MULTIPOLYGON (((898362.4 40...
## 9       warren MULTIPOLYGON (((741807.4 40...
## 10      stokes MULTIPOLYGON (((587556.5 40...

Then, apply st_area() to the projected polygon data:

v_area <- st_area(sf_nc_county)

# print the first 10 elements
head(v_area)
## Units: [m^2]
## [1] 1137120798  610923161 1423161473  694386434 1520383764  967515296

As we saw in the example of stream polylines, the object v_area contains the area of each county. Let’s attach them directly to the spatial object as a new column using dplyr::mutate() again.

(sf_nc_county_w_area <- sf_nc_county %>% 
  mutate(area = v_area))
## Simple feature collection with 100 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32377 ymin: 33.88212 xmax: -75.45662 ymax: 36.58973
## Geodetic CRS:  WGS 84
## First 10 features:
##         county                       geometry             area
## 1         ashe MULTIPOLYGON (((-81.47258 3... 1137120798 [m^2]
## 2    alleghany MULTIPOLYGON (((-81.23971 3...  610923161 [m^2]
## 3        surry MULTIPOLYGON (((-80.45614 3... 1423161473 [m^2]
## 4    currituck MULTIPOLYGON (((-76.00863 3...  694386434 [m^2]
## 5  northampton MULTIPOLYGON (((-77.21736 3... 1520383764 [m^2]
## 6     hertford MULTIPOLYGON (((-76.74474 3...  967515296 [m^2]
## 7       camden MULTIPOLYGON (((-76.00863 3...  615801604 [m^2]
## 8        gates MULTIPOLYGON (((-76.56218 3...  903433860 [m^2]
## 9       warren MULTIPOLYGON (((-78.30849 3... 1179078729 [m^2]
## 10      stokes MULTIPOLYGON (((-80.02545 3... 1232489004 [m^2]

Good, now each feature in sf_nc_county_w_area includes its corresponding area as an attribute. This process can be done in a single step as well.

sf_nc_county_w_area <- sf_nc_county %>% 
  st_transform(crs = 32617) %>%       # transform to projected CRS (utm zone 17n) for accurate area calculation
  mutate(area = st_area(.)) %>%       # calculate area of each polygon and store it in a new column
  st_transform(crs = 4326)            # transform back to geographic CRS (wgs84) for consistency with other layers